The pulldown experiments let us identify proteins that are localised to the presynaptic active zone which we are interested in studying. Other than just identifying the proteins it also results in a couple of measures which can be useful for predicting whether proteins interact:
This notebook looks at the second of these two:
The affinity measure which has already been extracted from this is based on this paper by Xie et al. They call their co-complexed score a C2S score and has a probabilistic basis, which is ideal for our use. All we need to do is extract a value for each protein pair we're looking at and see what kind of coverage these values give us.
Due to the convenient structure of the file we are using we can extract the feature with the functionality of ocbio.extract.
Each protein pair is in a row in Entrez Gene ID format with the corresponding C2S value.
In [2]:
    
cd ../../
    
    
Unfortunately, when running the ocbio.extract code naively we 
To improve the coverage we will first use the zeromissinginternal option in the FeatureVectorAssembler.
This will replace any interactions between proteins which the database knows about but doesn't have a value stored with zeros.
Secondly, we will interpolate the expected value for this feature based on the values of other features with better coverage. To do this, first we will generate a set of feature vectors over the bait and prey combinations. Using this, we will train a linear regression algorithm.
Then, we can pickle the trained linear regressor and store the table to create the feature vectors it must use. These can then be passed as options for the parser in the code to generate the feature vectors for the classification task.
Loading the features which were used for classification in the pipeline prototype notebook:
In [2]:
    
import sys,csv
    
In [4]:
    
sys.path.append("opencast-bio/")
    
In [5]:
    
import ocbio.extract
    
In [6]:
    
reload(ocbio.extract)
    
    Out[6]:
In [7]:
    
!git annex unlock datasource.proto.tab
    
    
In [8]:
    
f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.2.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()
    
In [16]:
    
assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)
    
    
In [12]:
    
assembler.regenerate(verbose=True)
    
    
In [17]:
    
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.vectors.txt", verbose=True)
    
    
In [18]:
    
f = open("datasource.affinity.tab","w")
c = csv.writer(f,delimiter="\t")
# just the affinity feature
c.writerow(["affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv",
            "affinityresults/results2/affinity.Entrez.db","zeromissinginternal=1"])
f.close()
    
In [7]:
    
!git annex unlock affinityresults/results2/affinity.Entrez.db
    
In [8]:
    
assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)
    
    
In [9]:
    
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.affinity.targets.txt", verbose=True)
    
    
Scikit-learn has an implementation of linear regression we can train to interpolate missing values. Specifically, in this case we will be using the ridge regression model to avoid overfitting on this large dataset. We will also be using 10-fold cross-validation and looking at the mean squared error over the test set to estimate the effectiveness of the model.
In [3]:
    
import sklearn.linear_model
    
In [4]:
    
#load vectors into memory:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
    
In [3]:
    
#loading y targets
y = loadtxt("features/pulldown.interactions.interpolate.affinity.targets.txt",dtype=str)
    
In [4]:
    
missinglabels = y == "missing"
    
In [5]:
    
#zero missing values
y[missinglabels] = 0.0
    
In [6]:
    
#convert to float
y = y.astype(np.float)
    
In [9]:
    
import sklearn.utils
    
In [10]:
    
X,y = sklearn.utils.shuffle(X,y)
    
In [11]:
    
import sklearn.cross_validation
    
In [12]:
    
kf = sklearn.cross_validation.KFold(y.shape[0],10)
    
In [13]:
    
for train,test in kf:
    print train,test
    
    
In [29]:
    
scores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    linreg = sklearn.linear_model.LinearRegression()
    linreg.fit(X_train,y_train)
    #test the classifier
    scores.append(linreg.score(X_test,y_test))
    
In [30]:
    
print scores
    
    
In [31]:
    
mean(scores)
    
    Out[31]:
According to the documentation:
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares
((y_true - y_pred) ** 2).sum()and v is the residual sum of squares((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0, lower values are worse.
So, this isn't a very good regression algorithm. Trying an ensemble method, such as an AdaBoost regressor.
In [15]:
    
import sklearn.ensemble
    
In [17]:
    
adascores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    ada = sklearn.ensemble.AdaBoostRegressor()
    ada.fit(X_train,y_train)
    #test the classifier
    adascores.append(ada.score(X_test,y_test))
    break
    
Had to break the above loop after one iteration as it was taking too long to run.
In [18]:
    
adascores
    
    Out[18]:
So the regression sum of squares much be three times larger than the residual sum of squares. Obviously, that's worse than the above.
In [24]:
    
Xycov = cov(X.T,y.T)
    
In [28]:
    
print Xycov.shape
    
    
In [38]:
    
print Xycov[:,-1][Xycov[:,-1]>1]
    
    
So several of the features are strongly correlated with this feature. Worth checking that this actually was the case.
Trying a ridge regression classifier, with different values for alpha:
In [123]:
    
ridgescores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    ridge = sklearn.linear_model.Ridge(alpha=1.0)
    ridge.fit(X_train,y_train)
    #test the classifier
    ridgescores.append(ridge.score(X_test,y_test))
    break
    
In [47]:
    
ridgescores
    
    Out[47]:
Still not better than simple linear regression.
In [49]:
    
plot(X[:,-2][:100],y[:100])
    
    Out[49]:
    
In [61]:
    
import time
    
In [125]:
    
N = 100
    
In [191]:
    
XN = X[:,:][N-100:N]
plot(XN/amax(XN,axis=0))
plot(y[N-100:N]/max(abs(y[N-100:N])), label="target")
ypred = ridge.predict(X[N-100:N,:])
plot(ypred/max(abs(ypred)))
legend()
N += 100
    
    
In [159]:
    
amax(XN,axis=0)
    
    Out[159]:
In [156]:
    
len(amax(X,axis=0))
    
    Out[156]:
So it looks like there is not much point to this approach. Would be as useful to simply fill with an average value.
In [7]:
    
print "Average value of pulldown affinity score over pulldown data: {0}".format(mean(y))
    
    
In [8]:
    
import pickle
    
In [9]:
    
f = open("affinityresults/results2/affinity.pulldown.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()